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Abstract 

The ability to visually recognize objects is a fundamental skill for robotics 
systems. Indeed, a large variety of tasks involving manipulation, navigation or in¬ 
teraction with other agents, deeply depends on the accurate understanding of the 
visual scene. Yet, at the time being, robots are lacking good visual perceptual sys¬ 
tems, which often become the main bottleneck preventing the use of autonomous 
agents for real-world applications. 

Lately in computer vision, systems that learn suitable visual representations 
and based on multi-layer deep convolutional networks are showing remarkable 
performance in tasks such as large-scale visual recognition and image retrieval. 
To this regard, it is natural to ask whether such remarkable performance would 
generalize also to the robotic setting. 

In this paper we investigate such possibility, while taking further steps in de¬ 
veloping a computational vision system to be embedded on a robotic platform, the 
iCub humanoid robot. In particular, we release a new dataset (iCubWorld 28) 
that we use as a benchmark to address the question: how many objects can iCub 
recognize? Our study is developed in a learning framework which reflects the 
typical visual experience of a humanoid robot like the iCub. Experiments shed 
interesting insights on the strength and weaknesses of current computer vision ap¬ 
proaches applied in real robotic settings. 


1 Introduction 

Visual perception is arguably one of the most important sensory channels for robotic 
systems that should operate in human environments. Indeed, the lack of good visual 
information becomes a major bottle neck for almost any task in which the robotic agent 
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Figure 1: Setup used to collect the iCubW 0RLD28 dataset. 


might be engaged, from simple manipulation to complex behaviors implying planning 
and navigation. 

In recent years, computational vision systems have witnessed tremendous progress, 
especially in the context of object recognition. Such a progress has been mainly driven 
by the development of machine learning methods for representing and classifying im¬ 
ages, based on multi-layers (deep) architectures (see ll^[T0ll24ll and, more recently, 
1291 m [HI). An important reason for the rapid evolution of this kind of systems 
was the acquisition of large public data-sets on which to train and benchmark the 
performance of new solutions, e.g. Caltech256 Ha, PascalVOC 02 and ImageNet 
LSVRC 123. All these datasets are essentially tailored to image retrieval problems 
and indeed this is the kind of task on which the performance of many vision systems 
have been ultimately tested. It is then natural to ask to which extent these new develop¬ 
ments can impact robotics systems where the vision tasks of interest are different from 
the typical retrieval scenario. 

The iCub humanoid ll22]l (see Figure[3 offers an ideal platform to address the above 
question that we began to investigate in mM- In particular, we started collecting and 
making available a dataset (iCubWorleQi that reflects the typical visual experience 
of iCub and testing different solutions for visual recognition. Our preliminary results 
confirmed on the one hand the potential of recently proposed systems, and on the other 
highlighted the challenges posed by the specific robotics context - in particular the lack 
of accurate supervision. 
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The current paper builds on our previous work to take a further step in the devel¬ 
opment of a computational vision system for the iCub. In particular, in this paper we 
conduct an empirical study aimed at answering the question; 

“How many objects can iCub recognize today?” 

We consider this problem within the Human-Robot Interaction scenario proposed in ||4] 
for the acquisition of the iCUB WORLD dataset. In the current work, a human teacher 
shows 28 different objects to the iCub, verbally annotating them using a speech recog¬ 
nition system to provide labeling. The same procedure is repeated for four consecutive 
days, leading to the acquisition of a new dataset, dubbed iCubW0RLD28. 

We conjugate the above broad question in further sub-problems that we describe in 
Sec.|^and address empirically in Sec.|^ We devote Sec.j^to provide some background 
on the general problem of learning visual representations. In Sec.|^we first review the 
robotic application that we designed to perform the acquisition of iCubWORLD28 
and then we describe the image representation pipeline employed in our experiments. 
Finally Sec. [^concludes the work, laying the foundations for future research. 


2 An ideal robotic visual recognition system 


By asking “How many objects can iCub recognize?”, in this work we aim to investi¬ 
gate the problem of achieving human-like visual recognition capabilities in robotics. 
To address this question we divide our analysis into multiple points: 

Reliability In order to be reproducible, our analysis will be performed off-line on a 
visual recognition dataset directly acquired from the robot cameras, iCubWorld28. 
However, in order to generalize the recognition performance observed on such a bench¬ 
mark, we will need a measure able to quantify the confidence with which we can expect 


such results to hold also in the real-world application. In Sec. 5.1 we propose a possible 
approach to this problem. 

Contextual Information. The robotic setting offers a great deal of contextual informa¬ 
tion that could be incorporated in the learning system to improve recognition perfor¬ 
mance. For instance, by observing an object from different points of view, the robot 
could be able to better disambiguate between different classes. Typically, contextual in¬ 
formation is not available in standard computer vision settings and therefore is unclear 


in general how to employ it in recognition. In Sec. 5.2 we start addressing this question. 


Learning incrementally. A human-like artificial system should be able to learn a richer 
model of the world as new observations become available. Specifically, it is natural to 
expect that the visual recognition system of a humanoid robot should benefit from the 
incorporation of visual data acquired on multiple occasions, such as training sessions 
across multiple days. A preliminary analysis of such an incremental setting is per¬ 


formed in Sec. 5.3 and represents a first step towards a true life-long learning system 


that continuously updates its internal model of the world. 





Self-Supervision. Ideally, the interaction between a human and a robot should take 
place along natural communication channels (for the human), such as speech or vision. 
Clearly, such a scenario limits the amount of supervision that a human teacher can pro¬ 
vide to the robot. For instance, in the human-robot application considered in this work, 
images cannot be manually segmented around the object of interest and therefore the 
system has to rely on so-called “weak” or “self-” supervised strategies, such as motion 
segmentation, to eliminate, at least partially, the visual distractors (e.g. background or 
other objects). In Sec. |5.4| we investigate this problem, evaluating the impact of having 
a finer (or coarser) segmentation for the images in iCubW0RLD28. 

3 Learning to Represent and Classify Objects 

Modern vision algorithms for recognition/categorization rely on machine learning rou¬ 
tines to identify a suitable representation for the visual data. Ideally, such a represen¬ 
tation (usually encoded in a real vector of finite dimension) should be on one hand 
discriminative, in the sense that images depicting different objects should be easily 
separable, while on the other hand being invariant to physical transformations of the 
scene (such as translations, rotations or deformations) that do not affect the actual ob¬ 
ject class. 

These methods are typically composed of two or more layers alternating convolu¬ 
tion and non-linear mappings of local image patches; the training process usually con¬ 
sists in the optimization of the weights in the convolution stage with respect to a given 
loss function (e.g. reconstruction etTor) separately or jointly for all layers. Several 
approaches where proposed to learn the local filters at the convolution level, such as 
Bag of Words ifTOl . Sparse Coding ||33l, Fisher Vectors ESll or HMAX lIMIl . Once the 
representation map has been learned, each novel image is mapped to the the new space 
where a classifier is trained using standard techniques for supervised learning such as 
SVM 1281, Regularized Least Squares (RLS) ifTTl or Neural Networks (NN) H]. 

Lately, the availability of extremely large image datasets and parallel computing re¬ 
sources, such as general-purpose GPUs, has made possible to train deep architectures 
as Convolutional Neural Networks (CNNs) on all layers simultaneously. According to 
recent empirical evidence ifTTl iMl l3l ISTl it appears that architectures trained on such 
a rich amount of visual information are able to develop extremely powerful represen¬ 
tations, and therefore can be used also on novel datasets as a “black-box” for image 
description. This approach is particularly appealing and is the one evaluated in this 
paper. Indeed, at the time being effectively training such complex architectures from 
scratch still requires very large numbers of examples, high computational times and, 
not least, the know-how to accurately tune their parameters, all factors that are not 
trivial in robotics settings where the application context can be not known a-priori. 

Other lines of research for visual recognition are based on keypoints matching tech¬ 
niques 11111261 mill. Although often employed in robotics settings ll^[^ l2^ . these ap¬ 
proaches are not particularly suited to applications where supervision is not accurate. 
Indeed, we previously observed in that, when employed in natural Human-Robot 
Interaction scenarios where supervision is weak, the performance of these methods de¬ 
grades remarkably. Hence in this work we do not cover these and related approaches 



Figure 2: The visual recognition system adopted in this work and currently imple¬ 
mented on iCub. 


and instead we focus on learning representation methods. 

4 Setup and Acquisition 

In this section we describe the image acquisition protocol used to collect the iCUB- 
World 28 dataset, as well as the implementation details of the visual recognition 
framework employed for the experimental analysis discussed in Sec. 

Setup. The application setup we employed in this work is analogous to the one de¬ 
scribed in 111 and we briefly outlined it in the introduction of this paper: a human 
supervisor is standing in front of the iCub robot and shows it different objects while 
verbally providing the class annotation (Figure [T] depicts a typical acquisition setting). 
Exploiting independent motion detection routines Q, the robot tracks the novel object 
while acquiring images at 33hz. The independent motion detection algorithm allows to 
perform an approximate localization of the object, effectively reducing the image size 
from 320 x 240 pixels to a mean of ~ 120 x 120 pixels (See Fig.j^for an example). 
Cropped images are then processed by a representation module (see Sec. that en¬ 
codes the visual information into a single vector or descriptor that will then be used for 
classihcation (Figure]^. 

Acquisition. Within the setting described above, we collected the iCubWORLD 28 
dataset which comprises images of 28 distinct objects evenly organized into 7 cate¬ 
gories (see Figure]^. For each object in the dataset, we acquired a separate train and 
test sets during sessions of 20 seconds each. We reduced the acquisition frequency by 
























Figure 3: Example images from one of the 4 datasets comprising iCubW0RLD28. As 
can be seen in the Figure, each dataset is composed by 28 objects organized into 7 
categories. 


a factor of 3 (i.e. acquiring one image around every 0.09 seconds) to lower the com¬ 
putational costs of the learning process. Thus, after each session, we collected 220 
train and 220 test images for each of the 28 objects. To assess the incremental learning 
performance of the iCub visual recognition system (see the discussion in Sec. |5.3[ ) we 
repeated this same acquisition protocol for 4 consecutive days, ending up with four 
datasets (Day 1, to 4) of more than 12k images each and 50k images in total. We will 
make this release available for the community at the same web address of the previous 
iCubWorld. 

Extracting visual representation. To extract visual representations of images ac¬ 
quired from iCub’s cameras, in this work we employed a CNN originally trained on 
the ImageNet dataset EtII . Specifically we employed a model provided in Caffe’s li¬ 
brary II3, BVLC Reference CaffeNet, which is available on-line and is based on the 
well-established network proposed in ll20l . Following the strategy proposed in Cl] Ell 
El, we employed the CNN as a black-box module that takes images in input and returns 
their corresponding vector representations in output. 

Learning. In visual recognition settings, the typical approach to classification is to 
employ so-called supervised learning methods such as Support Vector Machines or 
Regularized Least Squares (RSL). In this work we rely on the GURLS ll^ machine 
learning library to perform RLS. Indeed, as empirically observed from previous work 







on the iCub iH, RLS exhibited comparable or even better results than Support Vector 
Machines (using the lib linear 1(131 library). Moreover, the rank-one update rule for ma¬ 
trix inversion 031 provides a natural variant of the classic RLS algorithm to the setting 
in which training data is provided incrementally to the system (also the incremental 
RLS algorithm is implemented in the GURLS library). Clearly, this is a typical sce¬ 
nario in robotics applications and, as already mentioned in Sec.|^is a topic of interest 
in this work (see Sec. |33). 


5 A data sheet of iCub’s visual recognition capabilities 


In this section we empirically address the questions raised in Sec. with the aim of 
providing the reader with an ideal “data sheet” of iCub’s current visual capabilities. 
The guiding principle of our analysis is to answer the generic (and intentionally fuzzy) 
question ‘"How many object can iCub recognize?” where, with the word “recognize”, 
we refer to human-level visual capabilities. Indeed, in realistic robotic applications we 
need reliable perceptual systems that, at least for limited sets of objects, are virtually 
infallible. 


In order to investigate this problem, in Sec. 5.1 we hrst introduce and discuss a pos¬ 


sible way to measure the confidence with which the classification accuracy achieved by 
systems trained on our benchmark dataset, iCubWORLD 28, is expected to generalize 
during a generic run of the human-robot interaction application described in Sec. 
Then, in the following Sec. |5.2|[5.3| and [3!4| we consider natural approaches to improve 
recognition in the robotic settings, identifying possible future directions for research. 


In Sec. 5.6 we briefly report a comparison of different visual recognition systems for 


reference, while in Sec. 5.5 we provide a preliminary answer to the question motivating 
this work. 


5.1 Reliability and Scalability 

Ideally, a reliable recognition system should be robust with respect to set of objects it 
has to discriminate. In other words, we would like the classihcation performance of 
a predictor to not vary dramatically when we change the set of classes on which it is 
trained/tested. This problem is particularly relevant to this work since the main goal 
of our analysis - although limited to the dataset of 28 objects described in Sec. |^- 
is to offer insights on the expected recognition capabilities of iCub for any choice of 
objects. 

Therefore, to quantitatively measure the reliability of the visual system currently 
available on iCub, we performed multiple classification experiments for different sub¬ 
sets of classes in iCubWorld28 for the dataset corresponding to Day 1. More pre¬ 
cisely, for any t = 2,..., 26 we randomly selected ~ 400 different combinations of 
t object classes among the available 28 (to avoid the combinatorial explosion of (^®) 
experiments) and trained/tested the learning system described in Sec. on the corre¬ 
sponding reduced datasets. As a measure of performance for the resulting predictor we 
computed its average accuracy, namely the ratio of correct guesses with respect to the 
cardinality of the whole test set. For a hxed t, we interpreted the accuracy measured for 
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Figure 4; Empirical estimation of the probability distribution P{acc = A\t) for a 
predictor trained on a random set of t objects to have accuracy A. 


each individual experiment as one observation sampled from P{acc = A\t), namely 
the conditional probability that a predictor trained on a randomly chosen set of t classes 
would achieve accuracy equal to a value A between 0 and 1. In Fig. |^we report the 
empirical estimation of this distribution together with the associated empirical mean 
(white curve) and one the standard deviation (gray region). Specihcally, each column 
in the plot approximates P{acc = A\t) as the normalized histogram of the accuracies 
measured for a hxed t and is depicted as a vertical sequence of balls with radius directly 
proportional to the corresponding bin value. 

Apart from the expected drop in accuracy that we observe when the cardinality of 
the multi-class problem increases, this analysis provides us with useful insights: first 
notice that the slope of the mean accuracy reported in Fig. (white curve) experiences 
a remarkable decrease as the number of classes increases (e.g. after t = 10), suggest¬ 
ing that such a negative effect should become less and less disrupting as we learn new 
objects. To conhrm this trend and further investigate the behavior of such a recognition 
system, in the near future we will extend our analysis to a larger version of iCUB- 
World 28, containing more object classes and categories. Indeed, one of the guiding 
principles behind the iCubWorld project is actually to collect a dataset in constant 
expansion whose incremental growth would retrace the natural experience of a physical 
agent that explores an unknown environment and discovers new objects. 

Second, notice that for each fixed cardinality t, the distribution of accuracies P{acc = 
A\t), measured across the multiple trials, is clearly concentrated around its mean. More 
specihcally, this means that in general we can expect with high conhdence that a pre- 
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Figure 5: Confidence intervals for predictors trained on a randomly sampled set of 
objects. For a fixed number of objects t, the value on a curve C represents the minimum 
accuracy that we are guaranteed to achieve with the trained predictor, with conhdence 

C. 

dictor trained on a randomly selected set of t objects would have accuracy between 
±5% of the mean of P{acc = A\t). This offers a useful perspective on what recogni¬ 
tion performance we should expect during a typical run of the human-robot interaction 
application described in Sec. ideally for a any random selection of t objects. To 
better quantify the expected capability of the system to generalize its performance, in 
Fig .|^we report the minimum accuracy that we are guaranteed to achieve within spec¬ 
ified levels of confidence. In this context, the confidence c{A, t) for a given accuracy 
A and number of classes t was measured as 

c{A,t) = f P{acc = a\t)da (1) 

J A 

and in Fig. we report the conhdence level curves c{A, t) = C for different values of 
C. Such curves denote the minimum accuracy A guaranteed for a classiher trained on 
a random set of t objects. To better understand the implications of this analysis, let us 
consider for instance the Blue curve in Fig. related to 95% and passing by f = 15 
and A = 0.75: with high probability (95%) and for a random choice of 15 objects, the 
resulting predictor is guaranteed to achieve at least 0.75 classihcation accuracy. 

This result, and its corresponding visualization in Fig. is of particular use from 
a practical perspective since it can be employed as a reference “data sheet” to train 
the iCub. Indeed, depending on the desired conhdence C and the number t of objects 








Confidence Level; 80% 



# objects 


Figure 6; Improvement of the classification accuracy with respect to an increasingly 
large temporal filtering window. Results are shown for fixed conhdence level C = 80% 
(see Eq. 0 and Fig. 1^1. 


we want the robot to discriminate, Fig. [^informs us what is the approximate level of 
accuracy that we can expect to achieve with the classiher that we will train. 

5.2 Exploiting Contextual Information 

The classification performances reported in Fig. are clearly not comparable to the 
human-level accuracy that we would expect on the problem considered. Indeed, even 
for relatively low conhdence values such as 80% (Black curve), we observe a fast decay 
of the guaranteed accuracy, which falls under the 0.9 threshold just after 4 objects. 

A viable approach to mitigate this problem relies on noticing that the robotic setting 
offers a great deal of prior and contextual information that could remarkably improve 
performance. To this regard, let us consider the natural assumption that the class of an 
object does not change while the robot observes it from multiple points of view. In such 
a setting, given a set of w images (acquired from different viewpoints around the object 
of interest) and a trained classiher with a (per-frame) accuracy A, we can consider a 
new classihcation rule that combines the individual predictions on the set of w frames 
into a global label. For instance, if we assume that the w images are sampled i.i.d., 
we have that the rule returning the label that occurred at least 50% + 1 times would 











correctly classify the object with probability (or accuracy) 

w . 

P= (2) 

k=\w/2\+l ^ ' 

In principle, this strategy could be extremely beneficial; suppose for instance that the 
trained predictor has a per-frame accuracy of A = 0.7. Then, even for small sets (or 
windows) of just 3 images we would have improved classification accuracy of 0.78, 
while for a larger re = 21 we would achieve an impressive 0.97. 

We evaluated the approach described above on iCubW 0RLD28. In particular, 
since in our setting the samples are acquired as a stream of consecutive images and we 
are interested in on-line recognition, we chose to classify windows selecting the current 
frame together with the previous w — 1 ones. This approach could be interpreted as a 
sort of label-filtering process that suppresses “flickering” one-frame misclassification. 
Clearly, in this case Eq. 0 represents only an upper bound to the actual improvement 
that we can expect, since the i.i.d. assumption never holds (consecutive images in a 
stream are of course always correlated). 

Figure [^reports the effect of the label-filtering approach on the confidence curve 
associated to C = 80% introduced in Fig. We varied the size of the temporal win¬ 
dows from 0 (instantaneous) to 4 seconds, corresponding to a range of w between 1 and 
50 frames. Notice that even in this non i.i.d. scenario, the system performance clearly 
benefits from smoothing, in particular when several classes are considered. Probably 
this is due to the fact that, as the number of object to discriminate grows, the chance of 
short-lived “one-frame” misclassification increases proportionally. 


5.3 Incremental Learning: A week (almost) with iCub 


The temporal filtering strategy considered in Sec. |5.2| leads to an impressive boost in 
recognition accuracy. However, if we consider the original goal of achieving human- 
level performance on iCUB WORLD (say, for reference, 0.98 accuracy), we notice from 
Fig.j^that even for a relatively low confidence value of 80% the system is still lacking 
a significant accuracy gap. To this regard, in this section we take into account another 
aspect of robotics settings that could in principle improve the recognition capabilities 
of the system, namely the ability to learn incrementally. 

Indeed, the robotic scenario is naturally suited to life-long learning applications. 
Specifically, in visual recognition settings, novel training evidence could be provided to 
the robot incrementally (and in principle, indefinitely) in order to update its knowledge 
as the task requires. A first result, that empirically quantifies the importance of learning 
incrementally and motivates the experimental analysis of this section, is reported in 
Fig. 0 We consider the experimental setting introduced in Sec. |5.1| and report the 


curve associated to 80% confidence for classifiers trained on an incremental number of 
examples per class. As can be noticed, the incremental growth of the training data has 
a remarkable impact on the overall classification performance and opens the question 
of what would be the long-term effects of such a learning process on the system’s 
recognition capabilities. 
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Figure 7: Classification accuracy for fixed level of confidence C = 80% (see Eq. Q) 
and an incremental number of training examples per class. 


We recall that iCubW 0RLD28 is a dataset collected during 4 separate days and 
that for each day both a training and test set were acquired. We further recall that 
all experiments discussed so far were performed on a single day of iCubWorld 28, 
say Day 1. To the purpose of studying the impact of incremental learning on visual 
recognition, in the following we will take into account also to the remaining 3 days. In 
particular, we considered the learning setting in which we trained a classifier incremen¬ 
tally on the training sets of the first three days of iCubWorld 28 and then evaluated 
it on the tests set of the fourth “unseen” day. To reduce the amount of computations we 
focused only on the problem of correctly classifying the 28 objects in the dataset and 
report the measured accuracy in Fig.[^for classifiers trained starting respectively from 
Day 1 (Blue), Day 2 (Orange) and Day 3 (Yellow). On one hand, we notice that when 
provided only with training data acquired from a single day, the incremental learning 
accuracy exhibited by predictors follows a remarkably similar pattern for all days, sug¬ 
gesting the the three datasets contain a similar amount of information. On the other 
hand, we observe that while all these curves seem to saturate around ~ 0.65 accuracy, 
adding data from a new day allows to overcome such limitation, improving the over¬ 
all system performance (here we refer to the “jumps” observed for both the Blue and 
Orange curves as they switch between days). 

The results reported in Fig. seem to suggest that training across multiple days 
is more benehcial than training during a single session because it exposes the system 
to less redundant information. To confirm this observation we considered a further 
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Figure 8: Incremental learning on iCubWorld28. Blue, Orange and Yellow curves 
identify the classification accuracy of predictors trained incrementally starting from, 
respectively. Day 1, Day 2 and Day 3. We used the test set from Day 4 to assess the 
generalization performance of the classihers. 


experimental scenario where we compared the performance of a predictor trained on 
data acquired from all days with the accuracy achieved by other four classifiers, each 
trained on a different day of iCubW 0RLD28 taking the first 100 examples per class. 
In order to compare problems of identical dimension, the “mixed” dataset was created 
by taking the first 25 samples (per class) from the training set of each day. Table 
reports the resulting classification accuracy tested separately on each day. In line with 
the original intuition, we notice that predictor trained on the mixed dataset clearly 
outperforms the others on average. However, it is of particular interest to observe that 
even on a single day basis, the predictor trained on all days (and thus less exposed to 
redundant information) outperforms predictors trained and tested on the same day. 

5.4 Supervision and clutter 

An important component of the visual recognition framework considered in this work, 
is the motion detection routine that performs the preliminary crop around the object 
in the image (see Sec. |^. In this section we investigate the actual impact of such a 
strategy by comparing it with two other approaches. On one hand, we consider the 
setting where we take the whole image in input and no crop is performed, in order to 







TEST Accuracy (%) 




Day 1 

Day 2 

Day 3 

Day 4 

Average 


Day 1 

67.7 

41.9 

37.2 

67.2 

53.5 

z 

Day 2 

40.1 

67.8 

35.4 

66.8 

57.5 

< 

Pi 

Day 3 

62.0 

63.5 

66.4 

64.9 

64.2 


Day 4 

62.9 

64.1 

65.3 

67.1 

64.8 


All Days 

73.4 

71.0 

68.1 

68.9 

70.3 


Table 1; Accuracy of predictors trained on single days compared with a predictor 
trained on all days together. For a fair comparison, the training dataset have same 
size (100 examples per class). 



Figure 9: Different supervision strategies evaluated in this work. From left to right; 
Whole image (no segmentation), large (220x 220px) and small (120x 120px) bounding 
boxes cropped around the object of interest using motion detection (see Sec. and 
manual segmentation. 


understand the benefits of our method; on the other hand, we manually fix the bounding 
box around the object, to evaluate what could still be gained in terms of performance. 
Fig. [^reports an example of these strategies, considering the dataset acquired on Day 
1 . 

In Table 1^ we report the classification accuracy of recognition systems trained on 
images cropped accordingly to the strategies introduced above. We can notice that 
motion detection provides already a remarkable boost in performance with respect to 
taking the whole image, thus suggesting that the presence of the background has a 
disrupting effect. This result is actually surprising considering that the typical training 
data used for large image retrieval tasks such as ImageNet EtII . often depict large 
portions of the background as in our case. We point out that manual cropping provides 
further benefits to the classification accuracy. Although this strategy is not applicable 
to real robotics settings, this result encourages to develop finer approaches to object 
localization that would eventually lead to similar performance. 






TEST Accuracy (%) 

Image Crop 1 Crop 2 Manual 



Image 

50.6 

48.8 

36.3 

20.6 

z 

< 

Crop I 

50.3 

62.2 

57.7 

24.9 

ai 

Crop 2 

30.1 

50.8 

73.9 

28.7 


Manual 

6.8 

8.9 

12.2 

81.7 


Table 2: Comparison of the classification accuracy achieved by recognition systems 
trained on iCubWorl28 for different levels of supervision: whole image, crop 1 (220 x 
220px), crop 2 (120 x 120px) and manual segmentation. See Fig.|^for examples. 

Confidence 

98% 90% 80% 70% 50% 

# Objects 2 4 6 7 14 

Table 3: The maximum number of objects that iCub is able to recognize with 0.98 
accuracy. 


5.5 How many object can iCub recognize? 

We hnally come back to the original question regarding the maximum number of ob¬ 
jects that iCub can recognize with the visual recognition system described in this pa¬ 
per. Specihcally, we are interested in achieving human-level performance on iCUB- 
World 28, which we set, for reference, to a high value of accuracy: 0.98. Table 
provides the answer to this question, returning the maximum number of objects that 
can be recognized with accuracy of 0.98 for varying levels of conhdence. Overall we 
observe that only few objects are actually recognized with high conhdence. This re¬ 
sult shows that modern visual recognition approaches have actually opened the way to 
answer the ambitious question we asked in this work, but also that the problem is far 
from being solved. 

5.6 Comparison with other Visual Architectures 

For completeness, we close this work by providing a brief comparison with other meth¬ 
ods for visual recognition. In Table|^we report the classihcation accuracy of systems 
trained/tested on the different days of iCubWorld 28. The following architectures 
for visual representation learning were evaluated: Bag of Words (BOW) ifTOl . Sparse 
Coding ll33l . Fisher Vector ||25]| . VLAD ifTSl . PHOW l2l and the Overfeat implemen¬ 
tation ll29l of a Convolutional Neural Network. Due to space limitation we refer the 
reader to the original papers for more informations about these methods. However, 
we point out that these approaches can be divided in two groups: pre-trained “deep” 
architectures (the CNNs CajfeNet and OverFeaf) and single layer “shallow” represen- 




TEST Accuracy (%) 

Day 1 Day 2 Day 3 Day 4 Avg. 


PHOW|2| 

42.5 

39.0 

34.6 

39.0 

44.1 

BoWlfTOl 

44.9 

40.8 

35.3 

38.8 

41.1 

Sparse Coding [331 

29.2 

24.1 

21.9 

23.7 

30.6 

HMAXll^ 

30.5 

27.3 

25.4 

23.7 

32.8 

Fisher Vectors [251 

47.3 

44.7 

41.5 

44.3 

48.6 

VFAD [ED 

44.2 

40.0 

35.0 

38.1 

44.5 

CajfeNet [19| 

75.9 

70.9 

71.9 

73.9 

80.8 

OverFeat li^ 

66.8 

57.5 

57.7 

60.0 

68.3 


Table 4: Comparison of several architectures for visual representation learning applied 
to the visual classihcation problem of iCubWORLD 28. Modem Convolutional Neural 
Networks (the CajfeNet used in this work and Overfeat) clearly outperform previous 
methods. 


tations (the remaining methods), where the “dictionary learning” stage was carried out 
on a subset of the training set of iCubWORLD28. As can be noticed pre-trained CNNs 
clearly outperform the others and this was the main reason for the choice of CaffeNet 
for our experiments. 

6 Discussion and Future Work 

In this paper we tested the current visual recognition capabilities of a humanoid robot, 
the iCub. Our analysis addressed the generic question “How many objects can (cur¬ 
rently) iCub recognize ?”, which was then formulated more accurately as the problem 
of determining the maximum number of objects that state-of-the-art visual recognition 
systems can recognize with (virtually) perfect accuracy. 

We identified a natural human-robot interaction application as a possible testbed 
for our investigation of the visual recognition problem. In order to foster the re¬ 
producibility of our experiments, we collected a novel dataset within this scenario, 
iCubW 0RLD28, comprising images depicting 28 object classes and acquired over the 
course of 4 days. 

We approached the problem by hrst dehning a measure performance that would 
allow us to operatively quantify our confidence that results observed off-line on iCUB- 
W0RLD28 would then generalize to the real application. We then identihed multiple 
aspects of the robotics context that could be leveraged to improve the overall recog¬ 
nition capabilities of the otherwise purely-visual system. In particular we empirically 
observed that exploiting the temporal consistency of subsequent frames in the visual 
stream or adopting weakly-supervised strategies to reduce the amount of distractors in 
the image can be extremely benehcial. Following these principles we were able to pro- 






vide a preliminary answer to the original question. Our results show on one hand that 
modern visual representation architectures such as CNN are finally able to address vi¬ 
sual recognition in robotic settings but on the other hand they point out that the problem 
is extremely challenging and far from being solved. 
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